Implementation: Truncated Policy Iteration

In the previous concept, you learned about truncated policy evaluation. Whereas (iterative) policy evaluation applies as many Bellman updates as needed to attain convergence, truncated policy evaluation only performs a fixed number of sweeps through the state space.

The pseudocode can be found below.

We can incorporate this amended policy evaluation algorithm into an algorithm similar to policy iteration, called truncated policy iteration.

The pseudocode can be found below.

You may also notice that the stopping criterion for truncated policy iteration differs from that of policy iteration. In policy iteration, we terminated the loop when the policy was unchanged after a single policy improvement step. In truncated policy iteration, we stop the loop only when the value function estimate has converged.

You are strongly encouraged to try out both stopping criteria, to build your intuition. However, we note that checking for an unchanged policy is unlikely to work if the hyperparameter max_iterations is set too small. (To see this, consider the case that max_iterations is set to a small value. Then even if the algorithm is far from convergence to the optimal value function v_ or optimal policy \pi_, you can imagine that updates to the value function estimate V may be too small to result in any updates to its corresponding policy.)

Please use the next concept to complete Part 5: Truncated Policy Iteration of Dynamic_Programming.ipynb. Remember to save your work!

If you'd like to reference the pseudocode while working on the notebook, you are encouraged to open this sheet in a new window.

Feel free to check your solution by looking at the corresponding section in Dynamic_Programming_Solution.ipynb.